NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Kil, Jihyung; Mai, Zheda; Lee, Justin; Chowdhury, Arpita; Wang, Zihe; Cheng, Kerrie; Wang, Lemeng; Liu, Ye; Chao, Wei-Lun (December 2024, Advances in Neural Information Processing Systems 37 (NeurIPS 2024))

The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping, while comparing sofa designs helps optimize the aesthetics of our living space. Despite its significance, the comparative capability is largely unexplored in artificial general intelligence (AGI). In this paper, we introduce MLLM-COMPBENCH, a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). MLLM-COMPBENCH mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. We curate a collection of around 40K image pairs using metadata from diverse vision datasets and CLIP similarity scores. These image pairs span a broad array of visual domains, including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance. We use MLLM-COMPBENCH to evaluate recent MLLMs, including GPT-4V(ision), Gemini-Pro, and LLaVA-1.6. Our results reveal notable shortcomings in their comparative abilities. We believe MLLM-COMPBENCH not only sheds light on these limitations but also establishes a solid foundation for future enhancements in the comparative capability of MLLMs.
more » « less
Full Text Available
COMPBENCH: A Comparative Reasoning Benchmark for Multimodal LLMs

Kil, Jihyung; Mai, Zheda; Lee, Justin; Wang, Zihe; Cheng, Kerrie; Wang, Lemeng; Liu, Ye; Chowdhury, Arpita; Chao, Wei-Lun (December 2024, NeurIPS)

Full Text Available
Dual-View Visual Contextualization for Web Navigation

https://doi.org/10.1109/CVPR52733.2024.01369

Kil, Jihyung; Song, Chan Hee; Zheng, Boyuan; Deng, Xiang; Su, Yu; Chao, Wei-Lun (June 2024, IEEE)

Full Text Available
PreSTU: Pre-Training for Scene-Text Understanding

https://doi.org/10.1109/ICCV51070.2023.01401

Kil, Jihyung; Changpinyo, Soravit; Chen, Xi; Hu, Hexiang; Goodman, Sebastian; Chao, Wei-Lun; Soricut, Radu (October 2023, IEEE)

Full Text Available
One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones

https://doi.org/10.1109/CVPR52688.2022.01504

Song, Chan Hee; Kil, Jihyung; Pan, Tai-Yu; Sadler, Brian M.; Chao, Wei-Lun; Su, Yu (June 2022, Conference on Computer Vision and Pattern Recognition)

Full Text Available
One Step at a Time: Long-Horizon Vision-and-Language Navigation With Milestones

Song, Chan Hee; Kil, Jihyung; Pan, Tai-Yu; Sadler, Brian M.; Chao, Wei-Lun; and Su, Yu (January 2022, IEEE / CVF Computer Vision and Pattern Recognition Conference)

We study the problem of developing autonomous agents that can follow human instructions to infer and perform a sequence of actions to complete the underlying task. Significant progress has been made in recent years, especially for tasks with short horizons. However, when it comes to long-horizon tasks with extended sequences of actions, an agent can easily ignore some instructions or get stuck in the middle of the long instructions and eventually fail the task. To address this challenge, we propose a model-agnostic milestone-based task tracker(M-TRACK) to guide the agent and monitor its progress. Specifcally, we propose a milestone builder that tags the instructions with navigation and interaction milestones which the agent needs to complete step by step, and a milestone checker that systemically checks the agent’s progress in its current milestone and determines when to proceed to the next. On the challenging ALFRED dataset, our M-TRACK leads to a notable 33% and 52% relative improvement in unseen success rate over two competitive base models.
more » « less
Full Text Available

Search for: All records